This report will be divided into two main sections. The first part will be devoted to unsupervised learning using several methodologies to make clusters learned in class. The second part of the report will be devoted to a supervised learning using methods like KNN, bayes theorems and logistic regressions.
The data that we will be using throughout the analysis is the same as the one for the previous step of the project, it consist of AirBnB listings along with many characteristics about their location, owner info and services included. In the first part of the project we have already done the preprocessing of the data (imputation of missing data, selection of meaningful attributes, removal of non realistic data etc.). We will take this already modified data and use it for the rest of the project.
For computation purposes, we had to take only a smaller part of the original data, since it contained more than 20k observations, and our personal computers where taking more than 30 mins to compute the silhouette of each of the examples. With a smaller dataset, we kept the proportions of the groups, and tried to stay with the same overall characteristics, by splitting the data randomly.
The categorical variable of interest that we have chosen for making the true clusters is “room_type”. This variable contains 4 classes, which are: ‘Entire home/apt’, ‘Hotel room’, ‘Private Room’ and ‘Shared room’. Most of the instances are divided into entire homes or private rooms, and the other two categories combined add up to only 3.5% of the whole dataset.
We have performed the principal component analysis for the selected dataset, and plotted the first two principal components in terms of room_type. Below we see the graph.
We can see that ther is not a very strong separation when we choose the grouping criterion to be the room_type, maybe because there are many other characteristics that will have a greater impact on the separation of the groups. This is nevertheless very interesting, since we will see, throught the report, how the different algorithms and methods chosen will make different groups out of the same data, according to its different criterions.
For that you have to skip the categorical variable of interest for carrying out supervised classification. It is important to note that the groups obtained with the clustering may not be those that define the categorical variable of interest. Obtain conclusions from the analysis.
The unsupervised cluster analysis can be done with several methodologies. In this report we will present the following ones:
Clustering refers to a wide range of methods for finding subgroups (clusters) in a dataset. The aim of these techniques is to group observations so the instances assigned to a particular group are quite homogenous, whereas the observations in the other groups are as different as possible from each other.
In this section we are going to focus on the family of methods belonging to Partitional clustering. The idea of Partitional clustering is to start with a particular partition and exchange observations until an optimal clustering structure is reached. We are going to employ the most standard algorithms belonging to Partitional clustering such us K-means, PAM, and CLARA.
We start by assessing the existence of subgroups in our dataset with the use of the K-means algorithm. For every specific number of groups selected, the algorithm will try to minimize the within-cluster variation. This is to say, the difference between each observation belonging to a specific group and the sample mean associated to that specific group. The algorithm then stops when stabilizes.
We start by assessing the optimal K by conducting k-means for different number of K.
The results exhibited by the elbow graph are unclear. The within-cluster variation decreases smoothly as K increases (natural result). In order to assess the optimal number of K we are going to carry out a silhouette analysis. This technique provides more information regarding the goodness of fit of the clustering structure. This method delivers information not just about the distances within a given group but with respect to other groups. Therefore, well defined groups will yield better results than groups with week boundaries.
The results suggest that the optimal number of K is equal to two. Notwithstanding, the average silhouette for K 3 is really close to the one obtained for 2. Once again, the results are unclear. In order to get more insights about the optimal number of K we are going to employ the Gap Statistic. This technique compares the within-cluster variation with the expected value under assumption of K=1.
The results provided by the Gap-statistic suggest k=8. However, the results are unclear. There is not a well-defined local maximum.
Due to the fact that the previous results were unclear, we are going to conduct the K-means for K=3. Since the silhouette for k=3 was really close to the optimal and the rest of techniques assessed suggest a larger number of k.
The plot above shows the distribution of each cluster. It can be seen three well differentiated groups. Two streams of data composed by a different groups each and on top of both streams a third cluster. Unlike the original clustering structure, where let’s recall, we have two streams of data composed mainly by two groups (Private rooms and Entire Home) with a well-defined separation with each stream. In our case, each stream is assigned to a particular group.
## Silhouette of 3125 units in 3 clusters from silhouette.default(x = kmeans_X$cluster, dist = dist(small_X[, from 1:24], "euclidean")) :
## Cluster sizes and average silhouette widths:
## 854 2043 228
## 0.00852227 0.05132138 -0.14799621
## Individual silhouette widths:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.2269297 0.0001056 0.0226093 0.0250830 0.0783749 0.0982671
The average silhouette obtained for the previous case is really close to 0. This results suggest that the partitions obtained are week.
Due to the unclear results obtained so far we are going to carry out a clustering structure analysis with the use of PAM and CLARA. PAM and CLARA unlike K-means use the most central point of a group instead of the sample mean. Despite being computational more costly than K-means it allows us to take into consideration categorical variables as well with the use of Gower distances.
The results obtained with PAM differ from the ones obtained with K-means. As it can been seen depicted above, one of the streams of data is assigned to one group, whereas the other stream of data is assigned to the other two groups. The distribution of the clusters within this second stream of data is closer to the original distribution that the one indicated by the K-means algorithm.
## Silhouette of 3125 units in 3 clusters from silhouette.default(x = pam_data$cluster, dist = dist(small_X[, from 1:24], method = "manhattan")) :
## Cluster sizes and average silhouette widths:
## 1162 1072 891
## 0.19716402 -0.10062072 0.06103673
## Individual silhouette widths:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.29361 -0.05617 0.07916 0.05620 0.20494 0.33168
As it can been seen in the figure above, the average silhouette obtained for PAM with k=3 is really close to 0 as in previous method.
As it has been previously stated, PAM also works with mixed data. For this reason, we are going to assess the existence of different clusters with the use of the quantitative variables employed so far in addition to some categorical variables such as: property type, Centro, host_is_superhost, and cancelation_policy.
By assessing the average silhouette yielded by the different number of clusters with PAM using mixed data, we observe that the optimal amount of K suggested is equal to 2.
## Silhouette of 3125 units in 2 clusters from silhouette.default(x = pam_X_Gower_mat$cluster, dist = X_Gower_mat) :
## Cluster sizes and average silhouette widths:
## 2173 952
## 0.3345737 0.3164106
## Individual silhouette widths:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.3185 0.2903 0.3556 0.3290 0.4066 0.5054
When conducting PAM with K=2, the average silhouette yielded by the clustering structure defined is 0.32.
Although Clara is similar to PAM but employed for large data sets, we are going to carry out the analysis with it as well so we can compare results.
When conducting Clara with k=2, we observe that each stream of data is assigned to a different group.
## Silhouette of 3125 units in 2 clusters from silhouette.default(x = clara_X$cluster, dist = dist(small_X_transf, from method = "manhattan")) :
## Cluster sizes and average silhouette widths:
## 2224 901
## 0.1503541 0.1308600
## Individual silhouette widths:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.10785 0.09055 0.15970 0.14473 0.19828 0.24179
Furthermore, the average silhouette exhibited by this clustering structure is around 0.15.
Hierarchical clustering is a different approach to K-means clustering for identifying and discovering different groups within a dataset. However, in contrast to k-means, hierarchical clustering will create a hierarchy of clusters and therefore does not require us to choose the number of clusters (k) in advance. Besides this, hierarchical clustering has another added advantage over k-means clustering: its results can be easily visualized using a tree-based graph called a dendrogram.
Hierarchical clustering can be divided into two main types:
Agglomerative clustering: This method works in a bottom-up manner. That is, each observation is initially considered as a single-element cluster (leaf). At each step of the algorithm, the two clusters that are the most similar are combined into a new bigger cluster (nodes). This procedure is iterated until all points are a member of just one single big cluster. The result is a tree which can be displayed using a dendrogram.
Divisive hierarchical clustering: DIANA (DIvise ANAlysis) works in a top-down way and can therefore be seen as the reverse of agglomerative clustering. It begins with the root, in which all observations are included in a single cluster. At each step of the algorithm, the current cluster is split into two clusters that are considered most heterogeneous. The process is iterated until every observations has its own cluster. The number of clusters will then equal the number of instances.
Both of these two methods of hierarchical clustering will be performed on the dataset below.
Similar to k-means, the similarity of observations are measured using distance measures (Euclidean or Manhattan distance). The Euclidean distance is most commonly used, but that is not a factor with a strong influence in this method. On the other hand, an important question in hierarchical clustering is: How can the dissimilarity between two clusters of observations be measured? A number of different cluster linkage methods have been developed to answer this question. The most common methods are:
Complete linkage clustering: All pairwise distances between the elements in cluster 1 and the elements in cluster 2 are computed, and the largest value of these distances is taken as the distance between the two clusters. This method tends to create more compact clusters.
Single linkage clustering: All pairwise distances between the elements in cluster 1 and the elements in cluster 2 are computed, and the smallest value of these distances is taken as the linkage criterion. This methods tends to create long, “loose” clusters.
Average linkage clustering: All pairwise distances between the elements in cluster 1 and the elements in cluster 2 are computed, and the average value of these distances is taken as the linkage criterion. This method can create both compact or less compact clusters. There is not a general rule of thumb for this.
Ward’s minimum variance method: This method minimizes the total within-cluster variance. At each step the pair of clusters with the smallest between-cluster distance is merged. This method tends to produce more compact and equal clusters.
K = 3 # based on silhouette
# Compute distances
man_dist_X <- daisy(scale(X),metric="manhattan")
# Single linkage
single_X <- hclust(man_dist_X,method="single")
# Plot dendrogram
plot(single_X,main="Single linkage",cex=0.8)
rect.hclust(single_X,k=4,border=color_1)
cl_single_X <- cutree(single_X,4)
table(cl_single_X)
## cl_single_X
## 1 2 3 4
## 3122 1 1 1
# Plot of the first two PCs with the five clusters
colors_single_X <- c(color_1,color_2,color_3,color_4,color_5)[cl_single_X]
plot(X_pcs$x[,1:2],pch=19,col=colors_single_X,main="First two PCs",xlab="First PC",ylab="Second PC")
# Silhouette
sil_single_X <- silhouette(cl_single_X,man_dist_X)
plot(sil_single_X,col=color_1)
# Complete linkage
complete_X <- hclust(man_dist_X,method="complete")
# Plot dendrogram of the solution for k=5 as in K-means
plot(complete_X,main="Complete linkage",cex=0.8)
rect.hclust(complete_X,k=5,border=color_1)
cl_complete_X <- cutree(complete_X,5)
table(cl_complete_X)
## cl_complete_X
## 1 2 3 4 5
## 2936 43 68 72 6
# Plot of the first two PCs with the five clusters
colors_complete_X <- c(color_1,color_2,color_3,color_4,color_5)[cl_complete_X]
plot(X_pcs$x[,1:2],pch=19,col=colors_complete_X,main="First two PCs",xlab="First PC",ylab="Second PC")
# Silhouette
sil_complete_X <- silhouette(cl_complete_X,man_dist_X)
plot(sil_complete_X,col=color_1)
# Average linkage
average_X <- hclust(man_dist_X,method="average")
# Plot dendogram of the solution for k=5 as in K-means
plot(average_X,main="Average linkage",cex=0.8)
rect.hclust(average_X,k=5,border=color_1)
cl_average_X <- cutree(average_X,5)
table(cl_average_X)
## cl_average_X
## 1 2 3 4 5
## 3097 20 6 1 1
# Plot of the first two PCs with the five clusters
colors_average_X <- c(color_1,color_2,color_3,color_4,color_5)[cl_average_X]
plot(X_pcs$x[,1:2],pch=19,col=colors_average_X,main="First two PCs",xlab="First PC",ylab="Second PC")
# Silhouette
sil_average_X <- silhouette(cl_average_X,man_dist_X)
plot(sil_average_X,col=color_1)
# Ward linkage
ward_X <- hclust(man_dist_X,method="ward")
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
# Plot dendrogram of the solution for k=5 as in K-means
plot(ward_X,main="Ward linkage",cex=0.8)
rect.hclust(ward_X,k=5,border=color_1)
cl_ward_X <- cutree(ward_X,5)
table(cl_ward_X)
## cl_ward_X
## 1 2 3 4 5
## 789 898 877 425 136
# Plot of the first two PCs with the five clusters
colors_ward_X <- c(color_1,color_2,color_3,color_4,color_5)[cl_ward_X]
plot(X_pcs$x[,1:2],pch=19,col=colors_ward_X,main="First two PCs",xlab="First PC",ylab="Second PC")
# Silhouette
sil_ward_X <- silhouette(cl_ward_X,man_dist_X)
plot(sil_ward_X,col=color_1)
# This solution is probably the best one among the agglomerative hierarchical clustering methods
Now rises the question: which of the agglomerative hierarchical clustering methods works the best (gives the best results)? Based on the plots shown above, the conclusion could be made that the last method (Ward Method) yields the best results. To further ground this conclusion the Agglomerative Coefficient (AC) could be used.
# methods to assess
m <- c( "average", "single", "complete", "ward")
names(m) <- c( "average", "single", "complete", "ward")
# function to compute coefficient
ac <- function(x) {
agnes(scale(X), method = x)$ac
}
# get agglomerative coefficient for each linkage method
purrr::map_dbl(m, ac)
## average single complete ward
## 0.8937960 0.8637393 0.9240589 0.9836633
The AC shows the strength of the generated clustering structure. The closer the value is to 1, the more balanced the clustering structure is. The complete and Ward’s linkage methods generally yield higher AC values. Values closer to 0 imply less well-formed clusters, as can be seen above in the dendrogram of the single linkage. Important is to note that the AC tends to become larger as \(n\) increases, so it should only be used to compare the same dataset (with an equal \(n\)) and should therefore not be used across data sets of very different sizes.
Here we see that Ward’s method has the highest AC which means that it has the strongest clustering structure of the four methods assessed.
For DIANA, clusters are divided based on the maximum average dissimilarity which is very similar to the mean or average linkage clustering method outlined above.
# Divisive hierarchical clustering
diana_X <- diana(scale(X),metric="manhattan")
# Plot dendrogram of the solution
plot(diana_X,main="DIANA")
# Hit two times Return to see the dendogram
# The heights here are the diameters of the clusters before splitting
# Take k=5
rect.hclust(diana_X,k=5,border=color_1)
cl_diana_X <- cutree(diana_X,5)
table(cl_diana_X)
## cl_diana_X
## 1 2 3 4 5
## 1164 1029 882 8 42
# Plot of the first two PCs with the five clusters
colors_diana_X <- c(color_1,color_2,color_3,color_4,color_5)[cl_diana_X]
plot(X_pcs$x[,1:2],pch=19,col=colors_diana_X,main="First two PCs",xlab="First PC",ylab="Second PC")
# Silhouette
sil_diana_X <- silhouette(cl_diana_X,man_dist_X)
plot(sil_diana_X,col=color_1)
# Divisive Coeffectient
diana_X$dc
## [1] 0.9308197
The best method of the Hierarchical Clustering analysis seems to be the last one: the DIANA method. As can be seen above is that the divisive coeffectient of 0.9308197 is lower than the aglomerative coefficient of the Ward Method. This can be explained, however, by the fact that the Ward Method creates more equal clusters, which drives up the score. Reality is, though, that the data is imbalanced and therefore the clusters do not need to be equal.
This method, in contrast to traditional methods such as the partitional or hierarchical methods that compute clusters based on the data, tries to allocate a mixture of distributions to the data by allocating a measure of probability and uncertainty to the cluster assignments.
Model-based clustering tries to provide a soft-assignment, which means that observations have a probability of belonging to a specific cluster. This method has also the benefit of automatically identifying the optimal number of clusters.
We start by computing the BIC for the different types of covariance matrices that describe the distributions. The function mclust is able to work with up to 14 different covariance matrix configurations, and makes the computations up to the number of selected possible groups, and then selects the best value along with the best covariance.
At the end, this method will select the top 3 models that have a better fit to the data.
# compute the BIC for up to 6 groups
BIC_X <- mclustBIC(Z,G=1:6)
BIC_X
## Bayesian Information Criterion (BIC):
## EII VII EEI VEI EVI VVI EEE
## 1 -26197.24 -26197.24 -26191.93 -26191.93 -26191.93 -26191.93 -26199.98
## 2 -25928.67 -24215.31 -24550.06 -24172.62 -24458.99 -24180.68 -24556.54
## 3 -24095.06 -22877.76 -23599.08 -22807.74 -23607.79 -22809.34 -23607.07
## 4 -23327.40 -22564.89 -23623.50 -22570.19 -23059.04 -22365.52 -23631.34
## 5 -23351.57 -22506.64 -23651.35 -22494.21 -22721.32 -22167.06 -23659.35
## 6 -23375.72 -22487.55 -23675.94 -22441.53 -22504.76 -21925.50 -23684.07
## VEE EVE VVE EEV VEV EVV VVV
## 1 -26199.98 -26199.98 -26199.98 -26199.98 -26199.98 -26199.98 -26199.98
## 2 -24059.95 -24464.98 -24023.47 -24557.68 -24057.05 -24472.26 -24028.31
## 3 -22815.24 -23615.57 -22810.75 -23612.17 -22812.11 -23612.58 -22807.59
## 4 -22578.00 -23058.84 -22363.20 -23137.39 -22378.89 -23067.53 -22376.25
## 5 -22473.19 -22726.07 -22136.90 -23099.58 -22140.52 -22867.84 -22143.88
## 6 -22418.05 -22509.84 -21945.16 -23066.10 -21982.22 -22545.54 -21944.77
##
## Top 3 models based on the BIC criterion:
## VVI,6 VVV,6 VVE,6
## -21925.50 -21944.77 -21945.16
We can see that the chosen models are 2 VVE (ellipsoidal with equal orientation), and one VII (diagonal, varying volume and shape). From the VVE, there is one with 4 and other with 5 clusters, and one VII with 5 clusters. This means that the clusters obtained with the model should be between 4-5 in general terms.
Below we plot the graph of the results with the different covariance shapes and the resulting BIC scores. Keep in mind that these results are displayed as the negative of the BIC, thus we select the maximum possible value. We see that the graph is strognly increasing, until it reaches around 5 groups, where the BIC value starts to decrease again.
Now we will run the mclust function with the optimal solution obtained in the part above.
Mclust_X <- Mclust(Z,x=BIC_X)
summary(Mclust_X)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust VVI (diagonal, varying volume and shape) model with 6 components:
##
## log-likelihood n df BIC ICL
## -10846.07 3125 29 -21925.5 -23327.27
##
## Clustering table:
## 1 2 3 4 5 6
## 477 344 631 912 651 110
And finally, we can plot the classification according to this model-based clustering approach. And we plot also the probabilities of the data belonging to each of the different clusters.
From the probability graphs from each of the clusters, we see that there are some uncertainties in the cluster #1 and #4, since we see that there are more point that stay around the midline and less that go to the extremes (0,1).
To finish, we plot the uncertainty of the points that were assigned in a similar manner as the classification plot shown above. The bigger the points are, the less certain its allocation to its current cluster is.
par(mfrow=c(1,1))
plot(Mclust_X,what="uncertainty")
To conclude, we can see that the model-based approach is a very interesting one, since it does not base the clusters on the data itself, but rather builds dome possible mixture of models based on the structure of the covariance matrices of each of the distributions. The name of the model chosen is given by the characteristics and shape of the distribution, some might be ellipsiodal, other are spherical and some can also be diagonal. The package used “mclust” has a total of 14 possible multivariate mixtures.
We have seen that the multivariate mixture that better fits our case is the VVV, 5. This consists of a multivariate mixture with 5 ellipsoidal shapes with varying volume, shape and orientation, thus it does not really help much on the clarification of the distribution, but it gives us a general understanding that the data has indeed some different clusters that must be considered.
This part of the report will show three different methods for performing supervised learning. For this type of learning, we will need a response variable, this will be ‘room_type’, just as in the previous case.
The three different methods to be explained are the following:
K-nearest-neighbours KNN
Logistic Regression
K-nearest-neighbours is a relatively simple algorithm in which each new observation is predicted based on the K nearest neighbours of this new observation. KNN is a memory based algorithm, since it has to store all the training data for making future comparissons, and cannot be described in any closed-form. The Knn algortihm has shown to be very useful, but it can also be somehow computationally inefficient.
We will show below the process for classifying instances according to the ‘room_type’ attribute. This attribute, as discussed above, has 4 possible categories.
We must first split the data into training a testing datasets, we have chosen to do a 70-30 strategy, which is commonly used. Below we can see the division of the groups.
## [1] "The training set has 2187.000000 observations"
## [1] "The testing set has 938.000000 observations"
## [1] "Proportions in the Training set"
## [1] "Entire home/apt"
## [1] 0.5953361
## [1] "Private room"
## [1] 0.3689986
## [1] "Hotel room"
## [1] 0.02652035
## [1] "Shared room"
## [1] 0.009144947
## [1] "-------"
## [1] "Proportions in the Testing set"
## [1] "Entire home/apt"
## [1] 0.6481876
## [1] "Private room"
## [1] 0.3144989
## [1] "Hotel room"
## [1] 0.02452026
## [1] "Shared room"
## [1] 0.01279318
We can see that the proportions of the groups in the testing and training datasets are very similar. We must scale the data before we perform any computations. In this case, we will use the library ‘class’.
stan_X_train <- scale(X_train)
stan_X_test <- scale(X_test)
LER <- matrix(NA,nrow=40,ncol=1)
for (i in 3 : 40){
knn_output <- knn.cv(stan_X_train,Y_train,k=i)
LER[i] <- 1 - mean(knn_output==Y_train)
}
plot(1:40,LER,pch=20,col=color_1,type="b",xlab="K",ylab="LER",main="LER for logs of Spam data set")
K <- which.min(LER)
K
## [1] 18
knn_Y_test <- knn(stan_X_train,stan_X_test,Y_train,k=K,prob=TRUE)
# Confusion table
table(Y_test,knn_Y_test)
## knn_Y_test
## Y_test Entire home/apt Hotel room Private room Shared room
## Entire home/apt 511 0 97 0
## Hotel room 15 0 8 0
## Private room 53 0 242 0
## Shared room 6 0 6 0
# Obtain the Test Error Rate (TER)
prob_knn_Y_test <- attributes(knn_Y_test)$prob
knn_TER <- mean(Y_test!=knn_Y_test)
knn_TER
## [1] 0.1972281
# Make a plot of the probabilities of the winner group
# In blue, good classifications, in red, wrong classifications
colors_errors <- c(color_3,color_1)[1*(Y_test==knn_Y_test)+1]
plot(1:n_test,prob_knn_Y_test,col=colors_errors,pch=20,type="p",xlab="Test sample",ylab="Winning probabilities",
main="Winning probabilities")
In this section, we are going to assess the use of a set of classifiers that are based on the bayes rules. By assuming a specific probabilistic framework, the TER is minimized by the Bayes rule method. These classifiers compute the conditional probabilities associated to each class and predicts y with the associated class that exhibits a higher probability.
Firstly, we are going to employ the Linear Discriminant Analysis method. This method assumes that densities are gaussian, although we have seen that it can work well in some cases when this assumption does not hold. In our case, although we tried to tackle this issue in the first part of the project, we know that some of our variables are non-gaussian. We transformed them to push them to a more normal distribution but some of them are still not gaussian.
In order to compute the probabilities we need to compute sample mean vectors and the estimated prior probabilities for each group. For this particular method, the covariance matrix is assumed to be the same across populations.
Once the conditionals probabilities are computed, for any given observation in the test data set we predict its corresponding class (group) by identifying it with the class that exhibits the largest probability. The TER yielded by the method is around 0.16.
In the plot above we observe that the closer the observations are to the extreme probability values 1, 0, the more accurate the predictions are. We also see a high concentration of observations that are properly selected as Entire homes in the rage of 0.8 to 1.
On this case, we observe that the probabilities associated to each observation are really low compared to the previous case. Most of the observations that had an associated probability higher than 0.5 were wrongly selected. Whereas those that exhibited a lower probability were properly selected as not being hotel rooms.
Similar to the case of Entire homes, we observe that the accuracy is really high for those observations associated to extreme probability values (close to 1 and 0). Whereas most of the errors are given by observations situated in the middle of the plane.
Finally, for the case of Shared rooms, we observe that the observations associated to high probability values were properly chosen. Whereas the majority of the observations with low probability values were properly selected. Notwithstanding, we find numerous errors in that lowest part of the plane.
This model is similar to the previous method. However, in this case, we assume that the covariance matrices are different across groups.
Unlike the previous case, in order to be able to compute the QDA method we had to employ a dataset with just the majority classes (Entire homes and private rooms). Due to the imbalance structure of our dataset, the computation was not feasible.
The TER obtained after running this method is 0.16.
In the plot above we observe that while in the region of large probabilities there is an accumulation of properly identified classes, in the region of low probabilities we find that the majority of the observations were identified as private rooms despite being Entire homes.
The results observed above are mirroring the results found for the case of Entire houses. This is to say, we find many observations that were wrongly identified in the large probability region, and many observations properly identified in the low probability region.
###Naive Bayes (NB)
This method is a particular case of the Quadratic Discriminant Analysis. Unlike in the previous case, we assume that the predictors are independent and therefore, we do need to compute the variances and we reduce the number of estimates to compute.
The TER exhibited by the Naïve method is around 0.7, considerably higher than in previous cases.
Unlike in the previous methods considered, we find that for Entire homes, most of the observations that were in the large probability region were properly labelled. Whereas the majority of instances that were located in the low region were wrongly selected as Entire homes.
In the graph above we see that most of the observations associated with large probability values were wrongly selected as Hotel Rooms. Similarly, many instances with really low probability values were selected otherwise wrongly.
For the case of Private rooms, results are unclear. We observe a high dispersion of accurate classifications as well as errors across the plane.
Finally, in the case of Shared rooms, we observe that the majority of instances located in the high probability region were misclassified.
Linear regression is used to approximate the linear relationship between a continuous response variable and a number of predictor variables. However, when the response variable is categorical linear regression can’t be used. However, from a small adaptation to the linear regression method the logistic regression method was born. This method is similar to linear regression in many ways, but are used for binary and categorical response variables instead of continuous ones.
Generally the logistic regression method is used for binary variables, but can also be used for multinomial problems. Since the response variable in this case is not binary (it can take on 4 different values), a multinomial logistic regression will be performed.
train = cbind(X_train, Y_train)
names(train)[25] = "roomtype"
lr_train <- multinom(roomtype ~ .,data=train)
lr_test <- predict(lr_train,newdata=X_test)
# Number of rooms classified as each type
summary(lr_test)
## Entire home/apt Hotel room Private room Shared room
## 597 1 336 4
# Confusion table
table(Y_test,lr_test)
## lr_test
## Y_test Entire home/apt Hotel room Private room Shared room
## Entire home/apt 535 0 72 1
## Hotel room 16 0 7 0
## Private room 43 1 251 0
## Shared room 3 0 6 3
# Obtain the Test Error Rate (TER)
lr_TER <- mean(Y_test!=lr_test)
lr_TER
## [1] 0.1588486
The probabilities for each of the room types is plot below.
As we can see above, the method has a prediction error of 0.1588486 which is already not very bad. Let’s see if this can be improved by running a backward stepwise algorithm that will select which predictors will be the best at predicting and result in the best predicting score.
# Try to improve the test error rate by deleting predictors without discriminatory power
step_lr_train <- step(lr_train,direction="backward",trace=0)
# Have a look at the variables that have been retained in the model
summary(step_lr_train)$coefnames
## [1] "(Intercept)" "host_total_listings_count"
## [3] "accommodates" "bathrooms"
## [5] "bedrooms" "beds"
## [7] "price" "cleaning_fee"
## [9] "guests_included" "extra_people"
## [11] "minimum_nights" "maximum_nights"
## [13] "availability_30" "number_of_reviews"
## [15] "review_scores_location" "reviews_per_month"
As can be seen above, the algorithm as only selected 16 out of the total of 24 predictors. Let’s try to make some predictions with this new model.
# Classify the responses in the test data set with this new model
step_lr_test <- predict(step_lr_train,newdata=X_test)
# Number of rooms classified as each type
summary(step_lr_test)
## Entire home/apt Hotel room Private room Shared room
## 602 0 331 5
# Confusion table
table(Y_test,step_lr_test)
## step_lr_test
## Y_test Entire home/apt Hotel room Private room Shared room
## Entire home/apt 538 0 69 1
## Hotel room 16 0 7 0
## Private room 46 0 249 0
## Shared room 2 0 6 4
# Obtain the Test Error Rate (TER)
step_lr_TER <- mean(Y_test!=step_lr_test)
step_lr_TER
## [1] 0.1567164
# This model is slightly better than the original one
And the probabilities of the room types once again:
This newer model has an even lower prediction error of 0.1567164.
We have shown throught the report different methods and algorithms for performing unsupervised and supervised classification. We have seen that there exist many different techniques to approach each of the two mehotds.
Unsupervised classification does not know in advance which are the classifier groups, nor how many of them there are. Thus it is really interesting to see how the algortihms choose the groups in a completely different way than what we could expect. But this might be due the fact that there are some differences that are classified as more meaningful than others. For example, a private room and a hotel room might have different names, but maybe their overall characteristics are very similar, thus a unsupervised classification method could classify them together. Overall this method is normally used when we do not know in advance if there exist any type of groups, and how are they split.
Supervised classification is a machine learning technique, in which the algorithm trains with a set of predictors and a response variable. In our case the response variable is a categorical one. So there exist many different methods to best predict the category given a set of predictors only. We have shown in the report different methods like K-nearest-neighbours, models based on Bayes-Theorems, and finally the logistic regression. We have shown that the TER (Testing Error Rate) of all models is very similar. But we can never say that one method is always better, thus we must try all posible methods and compare their TER’s before giving a final answer.
Overall, in this assigment we have learnt and applied new techniques and methods to perform both supervised and unsupervised learning. We can see that there are definetely many ways of achieving the goal, and there is never a single method that outperforms all other, thus always trying many of them and comparing is considered one of the best approaches. On a last note, the dataset that we have selected for the project is a clear example of an imbalanced dataset, since out of the 4 existing classes, two of them represented over 85% of the whole sample, and the pther two accounted only for the remaining 15%. Therefore, the proportion of mistakes in the smaller groups tends to be larger than the ones in the larger groups. We decided to leave the proportion of the categories untouched, in order to represent the reality of the dataset.
This report has been done with help of the lecture notes in:
Topic 3, Unsupervised Classification, Pedro Galeano, Department of Statistics UC3M-Santander Big Data Institute Universidad Carlos III de Madrid.
Topic 4, Supervised Classification, Pedro Galeano, Department of Statistics UC3M-Santander Big Data Institute Universidad Carlos III de Madrid.
And the some extra comments and information have been taken from: